HiFrames: High Performance Data Frames in a Scripting Language

نویسندگان

  • Ehsan Totoni
  • Wajih Ul Hassan
  • Todd A. Anderson
  • Tatiana Shpeisman
چکیده

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Automatic Extraction of Verb Frames

This article explores the possibilities of automatic extraction of both surface and valency frames of Czech verbs. First, it is clearly documented that the data from Prague Dependency Treebank is not sufficient for collecting enough examples of verb frames to build a large scale lexicon. As a solution, an approach to pick nice examples of sentences from any texts is suggested and thoroughly des...

متن کامل

High Performance Development for High End Computing With Python Language Wrapper (PLW)

This paper presents a design and implementation of a system that leverages interactive scripting environment to the needs of scientific computing. The system allows seamless translation of high level script codes to highly optimized native language executables that can be ported to parallel systems with high performance hardware and potential lack of the scripting language interpreter. Performa...

متن کامل

TWO-STAGE METHOD FOR DAMAGE LOCALIZATION AND QUANTIFICATION IN HIGH-RISE SHEAR FRAMES BASED ON THE FIRST MODE SHAPE SLOPE

In this paper, a two-stage method for damage detection and estimation in tall shear frames is presented. This method is based on the first mode shape of a shear frame. We demonstrate that the first mode shape slope is very sensitive to the story stiffness. Thus, at the first stage, by using the grey system theory on the first mode shape slope, damage locations are identified in shear frames. Da...

متن کامل

A Binary Data Stream Scripting Language

Any file is fundamentally a binary data stream. A practical solution was achieved to interpret binary data stream. A new scripting language named Data Format Scripting Language (DFSL) was developed to describe the physical layout of the data in a structural, more intelligible way. On the basis of the solution, a generic software application was implemented; it parses various binary data streams...

متن کامل

Synthetic Programming: User-directed Run-time Code Synthesis for High Performance Computing

Scripting and interpreted languages are important tools for software engineering and are often used in place of compiled languages for application development. While they enable a high level of developer productivity, their run-time environments limit the overall performance attainable with any given application. To develop performance-critical applications, developers continue to rely on compi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1704.02341  شماره 

صفحات  -

تاریخ انتشار 2017